Reliability Markov models are becoming unreliable ( WIP submission )
نویسندگان
چکیده
Markov models have traditionally been used to understand the reliability of storage systems. They provide intuition about the sensitivity of storage system reliability to changes in disk failure rates, rebuild rates, sector failure rates, scrubbing rates, and storage capacity. Unfortunately, as we move towards multi-disk fault tolerant storage systems, i.e., storage systems that tolerate two or more disk failures such as RAID 6, reliability estimates based on traditional Markov models become unreliable. Our concerns go beyond the recent demonstration that Weibull distributions need to be used instead of exponential distributions to accurately determine storage system reliability [1]. We believe that the traditional construction of Markov models is flawed for multi-disk fault tolerant systems, and that their accuracy and utility decreases as the redundancy in the system increases. In this WIP, we will only discuss one of our concerns: modeling disk rebuild correctly. Two traditional Markov models are used to model two distinct storage rebuild policies. In a serial rebuild policy, a storage system rebuilds the first failed disk in its entirety before rebuilding the next failed disk, and so on. In a concurrent rebuild policy, a storage system begins rebuilding failed disks as they fail. Figure 1 illustrates the two traditional Markov models for an n disk system that tolerates m disk failures. The label of each state indicates the number of failed disks; state m + 1 is the data loss state. The transitions from left to right are disk failures, with λ being the failure rate. The transitions from right to left are disk rebuilds, with μ being the rebuild rate. For single disk fault tolerant systems, the serial and concurrent rebuild models are identical, and are correct. For multi-disk fault tolerant systems, both rebuild models are incorrect. The same modeling error is made in each case. The rebuild transitions for states 2 through m are incorrect: they model the rebuild of the disk that failed most recently, whereas reliability is dominated by the rebuild of the disk that failed earliest. In essence, traditional Markov models reset the rebuild time for all disks being rebuilt whenever another disk fails. The traditional serial rebuild Markov model thus models a rebuild policy 0 1 2 (n)
منابع مشابه
On the Workload Allocation Problem of Short Unreliable Production Lines with Finite Buffers
Serial flow or production lines are modeled as tandem queueing networks and formulated as continuous-time Markov chains to investigate how to maximize throughput or minimize the average work-in-process (WIP) when the total service time among the stations are fixed (this is the workload allocation problem). This paper examines the effect of the unreliability of the machines on the optimal worklo...
متن کاملModeling of Hybrid Production Systems with Constant WIP and Unreliable Equipment
Material flow in production systems can be controlled by a purely push-pull (just-in-time), or by a hybrid push-pull control mechanism. One type of push-pull production control can be implemented by controlling only the last stage during part withdrawals to trigger the production at the first stage. While the final stage is operated according to a pull mechanism, intermediate stages are operate...
متن کاملVacation model for Markov machine repair problem with two heterogeneous unreliable servers and threshold recovery
Markov model of multi-component machining system comprising two unreliable heterogeneous servers and mixed type of standby support has been studied. The repair job of broken down machines is done on the basis of bi-level threshold policy for the activation of the servers. The server returns back to render repair job when the pre-specified workload of failed machines is build up. The first (seco...
متن کاملTitle of Dissertation: OPTIMAL PREVENTIVE MAINTENANCE POLICIES FOR UNRELIABLE QUEUEING AND PRODUCTION SYSTEMS
Title of Dissertation: OPTIMAL PREVENTIVE MAINTENANCE POLICIES FOR UNRELIABLE QUEUEING AND PRODUCTION SYSTEMS Xiaodong Yao, Doctor of Philosophy, 2003 Dissertation directed by: Professors Steve I. Marcus and Michael C. Fu Department of Electrical and Computer Engineering Preventive Maintenance (PM) models have traditionally concentrated on utilizing machine “technical” state information such as...
متن کاملOptimal Preventive Maintenance Policies for Unreliable Queueing and Production Systems
Title of Dissertation: OPTIMAL PREVENTIVE MAINTENANCE POLICIES FOR UNRELIABLE QUEUEING AND PRODUCTION SYSTEMS Xiaodong Yao, Doctor of Philosophy, 2003 Dissertation directed by: Professors Steve I. Marcus and Michael C. Fu Department of Electrical and Computer Engineering Preventive Maintenance (PM) models have traditionally concentrated on utilizing machine “technical” state information such as...
متن کامل